A Kernel Independence Test for Geographical Language Variation

نویسندگان

  • Dong Nguyen
  • Jacob Eisenstein
چکیده

Quantifying the degree of spatial dependence for linguistic variables is a key task for analyzing dialectal variation. However, existing approaches have important drawbacks. First, they are based on parametric models of dependence, which limits their power in cases where the underlying parametric assumptions are violated. Second, they are not applicable to all types of linguistic data: some approaches apply only to frequencies, others to boolean indicators of whether a linguistic variable is present. We present a new method for measuring geographical language variation, which solves both of these problems. Our approach builds on Reproducing Kernel Hilbert space (RKHS) representations for nonparametric statistics, and takes the form of a test statistic that is computed from pairs of individual geotagged observations without aggregation into predefined geographical bins. We compare this test with prior work using synthetic data as well as a diverse set of real datasets: a corpus of Dutch tweets, a Dutch syntactic atlas, and a dataset of letters to the editor in North American newspapers. Our proposed test is shown to support robust inferences across a broad range of scenarios and types of data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Kernel Statistical Test of Independence

Although kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m...

متن کامل

The Application of Geographical Information System in Explaining Spatial Distribution of Low Birth Weight; a Case Study in North of Iran

Background: Geographical Information System is a new tool in environmental epidemiology that makes the opportunity of visualization and analysis of spatial data. The aim of this study was to determine the geographic variation of low birth weight using geographic information system in order to evaluate the efficacy of primary health care and health information system. Methods: Low birth weight r...

متن کامل

Self-Discrepancy Conditional Independence Test

Tests of conditional independence (CI) of random variables play an important role in machine learning and causal inference. Of particular interest are kernel-based CI tests which allow us to test for independence among random variables with complex distribution functions. The efficacy of a CI test is measured in terms of its power and its calibratedness. We show that the Kernel CI Permutation T...

متن کامل

Independence Tests based on the Conditional Expectation

In this paper we propose a new procedure for testing independence of random variables, which is based on the conditional expectation. As it is well known, the behaviour of the conditional expectation may determine a necessary condition for stochastic independence, that is, the so called mean independence. We provide a necessary and sufficient condition for independence in terms of conditional e...

متن کامل

A Wild Bootstrap for Degenerate Kernel Tests

A wild bootstrap method for nonparametric hypothesis tests based on kernel distribution embeddings is proposed. This bootstrap method is used to construct provably consistent tests that apply to random processes, for which the naive permutation-based bootstrap fails. It applies to a large group of kernel tests based on V-statistics, which are degenerate under the null hypothesis, and nondegener...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Computational Linguistics

دوره 43  شماره 

صفحات  -

تاریخ انتشار 2017